Load the package and the data
library(RColorBrewer) ## nicer color schemes
library(xcms) ## the package doing the job
library(tidyverse) ## potentially useful
library(knitr)
library(plotly)
This library is used only to get some raw data to play with …
library(faahKO)
we will analyze a subset of the data from in which the metabolic consequences of knocking out the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The raw data files (in NetCDF format) are provided with the faahKO data package. The data set consists of samples from the spinal cords of 6 knock-out and 6 wild-type mice. Each file contains data in centroid mode acquired in positive ion mode form 200-600 m/z and 2500-4500 seconds. To speed up processing of this vignette we will restrict the analysis to only 8 files and to the retention time range from 2500 to 3500 seconds.
Note A large part of this tutorial is taken from the official vignette of xcms. Many thanks to Steffen Neumann and Juhannes Rainer!
We start getting some raw data into R.
## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
recursive = TRUE)[c(1, 2, 5, 6, 7, 8, 11, 12)]
cdfs
## [1] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.1/faahKO/cdf/KO/ko15.CDF"
## [2] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.1/faahKO/cdf/KO/ko16.CDF"
## [3] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.1/faahKO/cdf/KO/ko21.CDF"
## [4] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.1/faahKO/cdf/KO/ko22.CDF"
## [5] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.1/faahKO/cdf/WT/wt15.CDF"
## [6] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.1/faahKO/cdf/WT/wt16.CDF"
## [7] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.1/faahKO/cdf/WT/wt21.CDF"
## [8] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.1/faahKO/cdf/WT/wt22.CDF"
As you can see we have four injections belonging to two classes, ko and wt.
In the last few years the xcms developer has been making a big effort to make their package coherent with a general framework for the handling of MS data in R (metabolomics, proteomics, …)
All this goes beyond the scope of our course, for us is sufficient to know that this infrastructure allows to store sample “metadata” (e.g. treatment class, time point, etc) together with the raw experimental data.
In our specific case, the tibble with the phenotype data could be designed as follow
phenodata <- tibble(sample_name = sub(basename(cdfs), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = c(rep("KO", 4), rep("WT", 4)))
phenodata
## # A tibble: 8 × 2
## sample_name sample_group
## <chr> <chr>
## 1 ko15 KO
## 2 ko16 KO
## 3 ko21 KO
## 4 ko22 KO
## 5 wt15 WT
## 6 wt16 WT
## 7 wt21 WT
## 8 wt22 WT
Up to now nothing has been actually loaded into R. To do that.
raw_data <- readMSData(files = cdfs,
pdata = new("NAnnotatedDataFrame", phenodata), ## this is the structure of xcms holding phenotypic data
mode = "onDisk") ## with this parameter the data are not loaded into RAM
Loading the full dataset into RAM can be problematic for large studies (we will see this in the specific demo) so with this specific on disk mode the raw data are still staying on the disk
We next restrict the data set to the retention time range from 2500 to 3500 seconds, just to spare some time …
raw_data <- filterRt(raw_data, c(2500, 3500))
In the general presentation we have clearly stated that one should be able to visualize the raw data, so let’s give a look to the structure of the r object we created
The raw_data object contains the full set of 2D data collected in all my 8 samples. The “raw” values can be extracted by using
rt <- rtime(raw_data)
mz <- mz(raw_data)
I <- intensity(raw_data)
Let’s look to the structure of these three objects
glimpse(rt)
## Named num [1:5112] 2501 2503 2505 2506 2508 ...
## - attr(*, "names")= chr [1:5112] "F1.S0001" "F1.S0002" "F1.S0003" "F1.S0004" ...
These are the retention time in seconds for the chromatography of all the 8 files. “F1.S0001” stands for File1, scan number 1 … it was recorded at 2501 seconds …
Another way to see that
plot(rt)
length(rt)
## [1] 5112
Where we see the increasing time scale for each file.
mz and I holds the mass spectra collected at each scantime … for this reason the two objects are lists and not vectors. Remember our data are 2d. For each scantime we have a complete mass spectrum
## only the first 20
glimpse(mz[1:20])
## List of 20
## $ F1.S0001: num [1:429] 200 201 202 203 204 ...
## $ F1.S0002: num [1:431] 200 201 202 203 204 ...
## $ F1.S0003: num [1:431] 200 201 202 203 204 ...
## $ F1.S0004: num [1:427] 200 201 202 203 204 ...
## $ F1.S0005: num [1:422] 200 201 202 203 204 ...
## $ F1.S0006: num [1:433] 200 201 202 203 204 ...
## $ F1.S0007: num [1:430] 200 201 202 203 204 ...
## $ F1.S0008: num [1:425] 201 202 203 204 205 ...
## $ F1.S0009: num [1:434] 201 202 203 204 205 ...
## $ F1.S0010: num [1:425] 201 202 203 204 205 ...
## $ F1.S0011: num [1:435] 201 202 203 204 205 ...
## $ F1.S0012: num [1:431] 201 202 203 204 205 ...
## $ F1.S0013: num [1:438] 201 202 203 204 205 ...
## $ F1.S0014: num [1:445] 201 202 203 204 205 ...
## $ F1.S0015: num [1:429] 201 202 203 204 205 ...
## $ F1.S0016: num [1:427] 201 202 203 204 205 ...
## $ F1.S0017: num [1:428] 201 202 203 204 205 ...
## $ F1.S0018: num [1:431] 201 202 203 204 205 ...
## $ F1.S0019: num [1:435] 201 202 203 204 205 ...
## $ F1.S0020: num [1:425] 201 202 203 204 205 ...
## only the first 20
glimpse(I[1:20])
## List of 20
## $ F1.S0001: num [1:429] 1716 1723 2814 1961 667 ...
## $ F1.S0002: num [1:431] 1502 1930 2893 1753 699 ...
## $ F1.S0003: num [1:431] 1453 1828 3145 1521 728 ...
## $ F1.S0004: num [1:427] 1650 1655 3499 1414 656 ...
## $ F1.S0005: num [1:422] 1807 1589 3686 1285 590 ...
## $ F1.S0006: num [1:433] 1698 1732 3778 1219 626 ...
## $ F1.S0007: num [1:430] 1486 2089 3938 1236 893 ...
## $ F1.S0008: num [1:425] 2280 4256 1333 1135 1628 ...
## $ F1.S0009: num [1:434] 2374 4358 1318 1226 1560 ...
## $ F1.S0010: num [1:425] 2539 4160 1226 1131 1453 ...
## $ F1.S0011: num [1:435] 3043 3844 1144 1002 1466 ...
## $ F1.S0012: num [1:431] 3730 3943 1211 858 1432 ...
## $ F1.S0013: num [1:438] 4521 4384 1318 779 1281 ...
## $ F1.S0014: num [1:445] 5422 4793 1439 733 1249 ...
## $ F1.S0015: num [1:429] 6319 4884 1534 782 1219 ...
## $ F1.S0016: num [1:427] 7074 4872 1677 871 1171 ...
## $ F1.S0017: num [1:428] 7600 4702 1624 985 1157 ...
## $ F1.S0018: num [1:431] 7726 4443 1476 1013 1150 ...
## $ F1.S0019: num [1:435] 7646 4142 1381 936 1144 ...
## $ F1.S0020: num [1:425] 7512 3877 1308 804 1196 ...
We can plot a complete spectrum (here the first scan of the first sample …)
plot(x = mz$F1.S0001,I$F1.S0001, type = "h", main = names(mz)[1])
xcms provides tools to visualize and play with the raw data, in a more direct way
## the size of our raw data
length(raw_data)
## [1] 5112
Is exactly the number of scans and
raw_data[[1]]
## Object of class "Spectrum1"
## Retention time: 41:41
## MSn level: 1
## Total ion count: 429
## Polarity: -1
is indeed a spectrum, which has a plotmethod
plot(raw_data[[1]]) +
scale_y_sqrt() +
theme_light()
This is a sort of gg stuff so if we want an interactive stuff we can rely on the plotly package, and also change some of the characteristics of the graphicla layout
ggplotly((plot(raw_data[[1]])))
Ok, working with all files together is not the best … for visualization and handling. To facilitate the “cutting” by file, xcms is provided with a split() function which can be combined with a fromFile function to create a list with the content separate by file
single_raw <- split(raw_data, fromFile(raw_data))
and each element of the list is now a single raw data
single_raw[[1]]
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.25 Mb
## - - - Spectra data - - -
## MS level(s): 1
## Number of spectra: 639
## MSn retention times: 41:41 - 58:20 minutes
## - - - Processing information - - -
## Data loaded [Thu Nov 4 11:14:41 2021]
## Filter: select retention time [2500-3500] and MS level(s), 1 [Thu Nov 4 11:14:41 2021]
## MSnbase version: 2.18.0
## - - - Meta data - - -
## phenoData
## rowNames: 1
## varLabels: sample_name sample_group
## varMetadata: labelDescription
## Loaded from:
## ko15.CDF
## protocolData: none
## featureData
## featureNames: F1.S0001 F1.S0002 ... F1.S0639 (639 total)
## fvarLabels: fileIdx spIdx ... spectrum (35 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
As we discussed, metabolites are visible as peaks ia 2d mz/rt plane …
mytibble <- tibble(rt = rtime(single_raw[[1]]), mz = mz(single_raw[[1]]), I = intensity(single_raw[[1]]))
mytibble
## # A tibble: 639 × 3
## rt mz I
## <dbl> <named list> <named list>
## 1 2501. <dbl [429]> <dbl [429]>
## 2 2503. <dbl [431]> <dbl [431]>
## 3 2505. <dbl [431]> <dbl [431]>
## 4 2506. <dbl [427]> <dbl [427]>
## 5 2508. <dbl [422]> <dbl [422]>
## 6 2509. <dbl [433]> <dbl [433]>
## 7 2511. <dbl [430]> <dbl [430]>
## 8 2512. <dbl [425]> <dbl [425]>
## 9 2514. <dbl [434]> <dbl [434]>
## 10 2515. <dbl [425]> <dbl [425]>
## # … with 629 more rows
And now a fancy plot ….
jet.colors <- colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))
test <- mytibble %>%
unnest(c("mz","I")) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(I)), size = 1) +
scale_color_gradientn(colours = jet.colors(7)) +
theme_light()
test
test +
xlim(2500,2750) +
ylim(355,380)
The second thing we would like to visualize is the Extracted ion trace
## Define the rt and m/z range of the peak area
rtr <- c(2700, 2900)
mzr <- c(334.9, 335.1)
## extract the chromatogram
chr_raw <- chromatogram(raw_data, mz = mzr, rt = rtr)
plot(chr_raw)
So we are able to actually see the chromatographic peak of the m/z 335
When one is dealing with the initial investigation of the data the first thing to do is to look to the total ion current of each chromatogram or to the base peak ion chromatogram
## Get the total ion chromatograms. This reads data from the files.
tics<- chromatogram(raw_data, aggregationFun = "sum")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")
## Plot all chromatograms.
plot(tics, col = group_colors[raw_data$sample_group])
## raw_data$sample_group extracts the info on the phenotipic data inside the raw_data
As you can see, the results are different with common look and feel
This visualization already shows you how it will be tricky to “match” the different samples. Some of the peaks are present everywhere, but others show-up only in specific samples …
The overall integral of the signal in each sample is often used as a way to spot unexpected analytical drifts
## here we rely on the old (and efficient) R style
total_I <- sapply(tics, function(x) sum(intensity(x)))
plot(total_I, col = factor(raw_data$sample_group), pch = 19, cex = 2)
Here the response is comparable, but if this would be the injection order, we should highlight the absence of a proper randomization of the samples! An important things to look at here is the TIC of QCs, they should be comparable!
I know that a specific metabolite present in my samples yields an ion @mz335 … let’s look to the profile of this ion signal over the chromatographic time
## here we get the traces ... compare the function with the one used for the TICs
ion_I_know <- chromatogram(raw_data, mz = c(334.9, 335.1))
plot(ion_I_know, col = group_colors[raw_data$sample_group])
The previous plot is important. It is telling us that the metabolite_I_know is present in the sample and is released by the chromatographic column at around 2800 sec … There it is producing a peal in the signal of the ion_I_know @mz335
To automatically find metabolites in my data I have to teach a computer to look for peaks in the chromatographic traces of all possible ions
bin - see ?bin for details)The “older” and most sounding way of finding peaks implemented in xcms is the matched filter algorithm.
A full description of the parameters of the algorithm can be found in the xcms manual, here we focus on
In xcms the parameters of the algorithm are stored into a specific object
mf <- MatchedFilterParam(binSize = 0.1,
fwhm = 30 ,
snthresh = 6)
mf
## Object of class: MatchedFilterParam
## Parameters:
## - binSize: [1] 0.1
## - impute: [1] "none"
## - baseValue: numeric(0)
## - distance: numeric(0)
## - fwhm: [1] 30
## - sigma: [1] 12.73994
## - max: [1] 5
## - snthresh: [1] 6
## - steps: [1] 2
## - mzdiff: [1] 0.6
## - index: [1] FALSE
Now I can use the previous parameters to find the peaks in one sample
first_peaks <- findChromPeaks(single_raw[[1]],param = mf)
The actual list of peaks can be extracted from the previous object by the method chromPeaks
Let’s look to the head of the output
first_peak_table <- chromPeaks(first_peaks)
dim(first_peak_table)
## [1] 464 13
head(first_peak_table, 5)
## mz mzmin mzmax rt rtmin rtmax into intf maxo maxf i sn sample
## CP001 200.1000 200.1 200.1 2928.610 2912.961 2942.695 147887.5 290507.1 9687 15899.06 1 9.615395 1
## CP002 201.0638 201.0 201.1 2531.112 2515.463 2549.892 204572.4 273087.9 7726 13172.68 1 9.050401 1
## CP003 205.0000 205.0 205.0 2784.635 2770.550 2800.284 1778568.9 3610061.8 84280 195026.47 1 48.163309 1
## CP004 205.9819 205.9 206.0 2786.200 2772.115 2800.284 237993.6 448580.2 10681 23860.10 1 24.225607 1
## CP005 207.0821 207.0 207.1 2712.647 2698.562 2726.731 380873.0 730981.3 18800 40065.74 1 18.000238 1
The first two numbers are telling us that with the setting we have been choosing we were able to find 464 peaks
The help of xcms describes the most relevant columns of the table
“mz” (intensity-weighted mean of mz values of the peak across scans/retention times), “mzmin” (minimal mz value), “mzmax” (maximal mz value), “rt” (retention time of the peak apex), “rtmin” (minimal retention time), “rtmax” (maximal retention time), “into” (integrated, original, intensity of the peak), “maxo” (maximum intensity of the peak), “sample” (sample index in which the peak was identified)
This is the map of the identified peaks superimposed to the “real” experimental data
peaks_tibble <- as_tibble(first_peak_table)
tibble(rt = rtime(single_raw[[1]]), mz = mz(single_raw[[1]]), I = intensity(single_raw[[1]])) %>%
unnest(c("mz","I")) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(I)), size = 0.1) +
geom_point(data = peaks_tibble, mapping = aes(x = rt, y = mz), col = "red", alpha = 0.9) +
scale_color_gradientn(colours = jet.colors(7)) +
theme_light()
Peak picking can also be performed with another algorithm: CentWave
cwp <- CentWaveParam(peakwidth = c(20, 80),
ppm = 30,
prefilter = c(3, 5000))
cwp
## Object of class: CentWaveParam
## Parameters:
## - ppm: [1] 30
## - peakwidth: [1] 20 80
## - snthresh: [1] 10
## - prefilter: [1] 3 5000
## - mzCenterFun: [1] "wMean"
## - integrate: [1] 1
## - mzdiff: [1] -0.001
## - fitgauss: [1] FALSE
## - noise: [1] 0
## - verboseColumns: [1] FALSE
## - roiList: list()
## - firstBaselineCheck: [1] TRUE
## - roiScales: numeric(0)
## - extendLengthMSW: [1] FALSE
Also here many parameters (and others are not mentioned). I highlight here some of them
If we run the peak picking with this new algorithm …
first_peaks_cw <- findChromPeaks(single_raw[[1]],param = cwp)
first_peak_table_cw <- chromPeaks(first_peaks_cw)
dim(first_peak_table_cw)
## [1] 353 11
head(first_peak_table_cw, 5)
## mz mzmin mzmax rt rtmin rtmax into intb maxo sn sample
## CP001 453.2 453.2 453.2 2509.203 2501.378 2527.982 1007409.0 1007380.8 38152 38151 1
## CP002 236.1 236.1 236.1 2518.593 2501.378 2537.372 253501.0 228013.2 12957 12 1
## CP003 594.0 594.0 594.0 2601.535 2581.191 2637.529 161042.2 149109.3 7850 13 1
## CP004 577.0 577.0 577.0 2604.665 2581.191 2626.574 136105.2 130646.0 6215 16 1
## CP005 369.2 369.2 369.2 2587.451 2556.151 2631.269 483852.3 483777.1 7215 7214 1
As you see the number of columns is different, but the key infos are there. Remarkably we have been able to find less peaks
peaks_tibble_cw <- as_tibble(first_peak_table_cw)
tibble(rt = rtime(single_raw[[1]]), mz = mz(single_raw[[1]]), I = intensity(single_raw[[1]])) %>%
unnest(c("mz","I")) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(I)), size = 0.1) +
geom_point(data = peaks_tibble_cw, mapping = aes(x = rt, y = mz), col = "red", alpha = 0.7) +
scale_color_gradientn(colours = jet.colors(7)) +
theme_light()
If we superimpose them …
tibble(rt = rtime(single_raw[[1]]), mz = mz(single_raw[[1]]), I = intensity(single_raw[[1]])) %>%
unnest(c("mz","I")) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.1, alpha = 0.5) +
geom_point(data = peaks_tibble_cw, mapping = aes(x = rt, y = mz), col = "red", alpha = 0.7) +
geom_point(data = peaks_tibble, mapping = aes(x = rt, y = mz), col = "green", alpha = 0.7, shape = 1) +
scale_color_gradientn(colours = jet.colors(7)) +
theme_light()
The difference is striking.
Obviously one could fiddle around with the parameters to look for a more coherent picture, but the difference is not unexpected considering the fact that we are dealing with two different approaches
When we are satisfied with a set of peak picking parameters, the algorithm will be sequentially run on all the files of the dataset resulting in a large list of peaks assigned to the different samples.
xdata <- findChromPeaks(raw_data, param = cwp)
Here a table of the peaks found in all files
table(chromPeaks(xdata)[,"sample"])
##
## 1 2 3 4 5 6 7 8
## 353 426 191 211 382 254 216 280
An overall representation of their distribution in the plane is extremely interesting
chromPeaks(xdata) %>%
as_tibble() %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = into), size = 0.3) +
facet_wrap(~sample) +
theme_light()
As you can see the samples are different, but the overall “look and feel” is coherent. This is telling us that the overall analytical run was good.
I was mentioning retention time shifts …
chromPeaks(xdata) %>%
as_tibble() %>%
filter(sample %in% c(1,8)) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = factor(sample)), siaze = 0.3) +
scale_color_brewer(palette = "Set1") +
theme_light()
The shift is clear! It is small, but it is visible. the shift is responsible of a difference in the samples coming from the analysis and not the biology. This shift has to be corrected to avoid biased results
xcms can do much more to browse and characterize the peaks, but here we want to focus on the key ideas.
In summary:
The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment.
Also here a plethora of approaches is available. As usual, everything will work better if the chormatography is more reproducible (for GC, for example, retention time correction is often not necessary).
In xcms the most used and reliable method for alignment of high resolution experiments is based on the obiwarp approach. The algorithm was developed for proteomics and is based on dynamic time warping.
The alignment is performed directly on the profile-matrix and can hence be performed independently of the peak detection or peak grouping.
xdata <- adjustRtime(xdata, param = ObiwarpParam(binSize = 0.2))
It is of utmost importance to check the amount of correction since large time shifts are not reasonable
plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group])
The previous plot shows the extent of correction of the different samples (one is not corrected since is considered the reference).
As you can see the correction is never bigger than 15 seconds. With a chromatographic peak width of around 30 seconds this is more than acceptable and, another time it speaks of a overall good analytical reproducibility
xdata now still contains the list of the peaks for the different samples, but now they retention time should be less erratic …
chromPeaks(xdata) %>%
as_tibble() %>%
filter(sample %in% c(1,8)) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = factor(sample)), siaze = 0.3) +
scale_color_brewer(palette = "Set1") +
theme_light()
As you can see the situation has improved and some of the vertical stripes are now well aligned.
The last step is to find a consensus list of variable across the different samples, **these will be the features which will. The list of peaks is now aligned in retention time but:
The common way of doing this step in xcms relies in a density based approach discussed in the slides.
The algorithm combines chromatographic peaks depending on the density of peaks along the retention time axis (all peaks found in all samples together!) within small slices along the mz dimension.
Car should be taken to account for the fact that a peak could be absent in a sample or in a set of samples and to avoid, in the meantime, to keep peaks found only in one sample.
As before, the parameters of this step are included in a specific object
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.5,
bw = 30,
binSize = 0.3)
A set of peaks will be considered a candidate to become a “valid” group if it contains peaks coming from at least a minFraction of samples belonging to one of the sampleGroups.
An example will make this more clear. Suppose I have a dataset with two sample groups: one of 4 samples, the other of 6. If I set minFraction to 0.5, a group of peaks will be considered a feature if it contains at least:
… or more.
Grouping is finally performed with
xdata <- groupChromPeaks(xdata, param = pdp)
The features are now the variables which will show-up in the data matrix. Their definition has been added by the groupChromPeaks method to the xdata object (which also contains the definition of the peaks of the different samples)
The definition can be extracted as a dataframe
myfeatures <- featureDefinitions(xdata)
head(myfeatures, 5)
## DataFrame with 5 rows and 11 columns
## mzmed mzmin mzmax rtmed rtmin rtmax npeaks KO WT peakidx ms_level
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <list> <integer>
## FT001 200.1 200.1 200.1 2925.48 2875.80 2930.17 6 2 3 448,1035,1290,... 1
## FT002 205.0 205.0 205.0 2789.68 2781.90 2795.53 8 4 4 74,423,828,... 1
## FT003 206.0 206.0 206.0 2789.33 2782.48 2793.99 7 3 4 48, 822,1020,... 1
## FT004 207.1 207.1 207.1 2719.67 2714.55 2725.94 8 4 3 30, 36,393,... 1
## FT005 219.1 219.0 219.1 2520.70 2515.89 2526.17 4 2 2 356, 974,1185,... 1
The table contains:
mzmed,mzmin,mzmax,rtmed,rtmin,rtmax)npeaksThe (almost) final untargeted data matrix can be extracted from the same object with
DM <- featureValues(xdata, value = "into")
dim(DM)
## [1] 287 8
head(DM)
## ko15.CDF ko16.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF wt21.CDF wt22.CDF
## FT001 NA 506848.88 NA 169955.6 197063.6 NA 169273.9 109764.03
## FT002 1924712.0 1757150.96 1383416.7 1180288.2 2129885.1 1634342.0 1623589.2 1354004.93
## FT003 213659.3 NA 151850.9 178285.7 253825.6 241844.4 240606.0 185399.47
## FT004 349011.5 451863.66 343897.8 208002.8 364609.8 360908.9 NA 221937.53
## FT005 NA 75563.88 NA 107348.5 223951.8 NA NA 84772.92
## FT006 286221.4 NA 137261.0 149097.6 255697.7 311296.8 181640.2 271128.02
The intensity used to build the data matrix is normally chosen as
into: integrated, original, intensity of the peakmaxo: maximum intensity of the peakIn our simple example we have 309 variables measured over 8 samples
head(DM)
## ko15.CDF ko16.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF wt21.CDF wt22.CDF
## FT001 NA 506848.88 NA 169955.6 197063.6 NA 169273.9 109764.03
## FT002 1924712.0 1757150.96 1383416.7 1180288.2 2129885.1 1634342.0 1623589.2 1354004.93
## FT003 213659.3 NA 151850.9 178285.7 253825.6 241844.4 240606.0 185399.47
## FT004 349011.5 451863.66 343897.8 208002.8 364609.8 360908.9 NA 221937.53
## FT005 NA 75563.88 NA 107348.5 223951.8 NA NA 84772.92
## FT006 286221.4 NA 137261.0 149097.6 255697.7 311296.8 181640.2 271128.02
Note that DM holds samples in columns and variables in rows, so it should be transposed to be ready for the standard analysis
So we finally get there, we have our data matrix full of intensities, but (as usual) missing values are not absent …
Another time:
To go on we have to try to fill at least a part of the holes with a reasonable number. We could for sure rely on the strategy we used yesterday for targeted data, but in the case of LC-MS data we could do something more clever.
Well, the table of the definition of the features tells us the chromatographic limits of all features in terms of rtmin and rtmax. A good idea would be to go back to the raw data and use the signal that was in fact measured there to put a reasonable number where the NA shows up
This smart approach (which works well in many cases, even if there are exceptions) is implemented in xcms in the fillChromPeaks function
xdata_filled <- fillChromPeaks(xdata)
Now our filled data matrix looks like this …
DM_f <- featureValues(xdata_filled, value = "into")
head(DM_f)
## ko15.CDF ko16.CDF ko21.CDF ko22.CDF wt15.CDF wt16.CDF wt21.CDF wt22.CDF
## FT001 169215.4 506848.88 103361.16 169955.6 197063.6 101738.0 169273.9 109764.03
## FT002 1924712.0 1757150.96 1383416.72 1180288.2 2129885.1 1634342.0 1623589.2 1354004.93
## FT003 213659.3 315432.31 151850.86 178285.7 253825.6 241844.4 240606.0 185399.47
## FT004 349011.5 451863.66 343897.76 208002.8 364609.8 360908.9 213569.5 221937.53
## FT005 219748.8 75563.88 90903.61 107348.5 223951.8 114830.9 167913.1 84772.92
## FT006 286221.4 272586.54 137260.97 149097.6 255697.7 311296.8 181640.2 271128.02
A clear improvement, isn’t it ? :-)
This data matrix will be the starting point of our statistical analysis …
Furthermore, we can also manually check its peak shape, as well as the integrated area. Within xcms there is the function featureChromatograms() which allows to visualize the EIC of an specific feature, displaying the area integrated by the algorithm:
sample_colors <- group_colors[xdata$sample_group]
ft_chr1 <- featureChromatograms(xdata, features = rownames(DM)[10],
expandRt = 15, filled = FALSE)
ft_chr2 <- featureChromatograms(xdata, features = rownames(DM)[10],
expandRt = 15, filled = TRUE)
par(mfrow = c(1, 2))
plot(ft_chr1, col = group_colors[xdata$sample_group],
peakBg = sample_colors[chromPeaks(ft_chr1)[, "sample"]])
plot(ft_chr2, col = group_colors[xdata$sample_group],
peakBg = sample_colors[chromPeaks(ft_chr2)[, "sample"]])
legend("topright", col = gsub("60", "", group_colors),
legend = names(group_colors), pch = 19)
Another way to check the feature of interest is using the function plotChromPeakDensity():
chr_mzr <- chromatogram(xdata, mz = myfeatures$mzmed[10] + 0.01 * c(-1, 1))
plotChromPeakDensity(chr_mzr, col = group_colors, param = pdp,
peakBg = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
peakCol = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
peakPch = 16)